A system for creating and manipulating generalized wordclass transition matrices from large labelled text-corpora
نویسندگان
چکیده
This paper deals with the training phase of a Markov-type linguistic model that is based on transition probabilities between pvirs and triplets of syntactic categories. To determine the o?timal level of detail for a set of syntactic classes we developed a systetn that uses a set-theoretical formalism to defiue such sets mid has some measm~s to comp~uce and c,ptimize them fildividually. In section two we describe the optimizafiou problem (hi terms of piediction, infoimation and economy requilements) and our approach to its solution. Section three introduces the system dlat will assist a lhlguist in h,'mdling the prediction and economy criteria and in the last section we plesent some slunple lemtlts that can be achieved with it.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملAutomated Verb Sense Labelling Based on Linked Lexical Resources
We present a novel approach for creating sense annotated corpora automatically. Our approach employs shallow syntacticosemantic patterns derived from linked lexical resources to automatically identify instances of word senses in text corpora. We evaluate our labelling method intrinsically on SemCor and extrinsically by using automatically labelled corpus text to train a classifier for verb sens...
متن کاملRobust H_∞ Controller design based on Generalized Dynamic Observer for Uncertain Singular system with Disturbance
This paper presents a robust ∞_H controller design, based on a generalized dynamic observer for uncertain singular systems in the presence of disturbance. The controller guarantees that the closed loop system be admissible. The main advantage of this method is that the uncertainty can be found in the system, the input and the output matrices. Also the generalized dynamic observer is used to est...
متن کاملCreating a Multilingual Collocation Dictionary from Large Text Corpora
This paper describes a system of terminological extraction capable of handling multi-word expressions, using a powerful syntactic parser. The system includes a concordancing tool enabling the user to display the context of the collocation, i.e. the sentence or the whole document where the collocation occurs. Since the corpora are multilingual, the system also offers an alignment mechanism for t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1988